An Exploratory Data Analysis on Airline Customer Satisfaction

Parv Bhargava, Jehan Bugli, Venkata Madisetty, and Namratha Prakash

2023-10-20

Introduction

Airline passenger satisfaction is a crucial metric for firms in the airline industry. Understanding the factors that contribute to customer satisfaction is essential for airlines to improve their services and compete effectively; high market saturation, as well as low profit margins, can magnify the effects of small advantages or disadvantages relative to other firms (Lutz et al., 2012; Hardee, 2023). In this research, we will analyze various factors that affect airline passenger satisfaction — provided through a survey dataset — and, ultimately, judge their suitability for a regression model predicting passenger satisfaction.

Research Proposal

Our research will first look at individual variables in the aforementioned survey dataset to examine distributions and other characteristics. Then, we will identify a regression model that may be congruent with our dataset and test assumptions associated with the model.

We will leverage a Kaggle dataset that includes surveyed passenger characteristics, flight details, and satisfaction ratings for select pre-flight and in-flight components (Klein, 2020). To ensure modeling suitability, we will conduct exploratory data analysis, taking into account variable distributions and types.

SMART Questions

With our research, we aim to make progress towards answering the following questions:

  1. To what extent do certain surveyed passenger characteristics and flight experience components impact the likelihood that a passenger will be satisfied – rather than neutral or dissatisfied – with their trip?

  2. How can we model the likelihood of passenger satisfaction using surveyed passenger characteristics and flight experience components in a manner that minimizes predictive bias?

  3. To what extent can we predict the likelihood that a flight passenger will be satisfied with their experience using multiple different variable levels?

Objective

This research offers an opportunity to assess the limitations of linear regression models in predicting passenger satisfaction, specifically with regards to the categorical nature of the output in this dataset. Through exploratory data analysis (EDA), we can identify the characteristics of our data and subsequently illustrate why a linear regression model may not be suitable for this analysis. This will lay the groundwork for our future research on logistic regression.

In summary, our research will provide insights into the intricate relationship between passenger characteristics, flight experience, and satisfaction levels. We will also explore the limitations of linear regression models and prepare the foundation for a more advanced logistic regression approach in future analysis.

Dataset Variables

The dataset for our research on airline passenger satisfaction contains various variables, which can be categorized into three types: continuous, categorical, and ordinal. In this section, we’ll list and briefly explain each of these variables.

Continuous Variables

  1. Age: This variable represents the actual age of the passengers.

  2. Flight Distance: Flight distance is the distance covered during the journey, measured in miles.

  3. Departure Delay in Minutes: This variable indicates the number of minutes by which a flight was delayed during departure.

  4. Arrival Delay in Minutes: Similarly, this variable represents the number of minutes by which a flight was delayed during arrival.

Categorical Variables

  1. Gender: Gender is a categorical variable indicating the gender of the passengers.

  2. Customer Type: The “Customer Type” variable categorizes passengers based on their customer loyalty.

  3. Type of Travel: This variable categorizes the purpose of the flight.

  4. Class: “Class” indicates the travel class in the plane.

Ordinal Variables

The following variables represent satisfaction levels, which are ordinal in nature, with values ranging from 0 to 5. According to the documentation, 0 is used to encode “Not Applicable” values.

  1. Inflight Wifi Service: Satisfaction level of the inflight wifi service.

  2. Departure/Arrival Time Convenient: Satisfaction level of departure/arrival time convenience.

  3. Ease of Online Booking: Satisfaction level of online booking.

  4. Gate Location: Satisfaction level of gate location.

  5. Food and Drink: Satisfaction level of food and drink.

  6. Online Boarding: Satisfaction level of online boarding.

  7. Seat Comfort: Satisfaction level of seat comfort.

  8. Inflight Entertainment: Satisfaction level of inflight entertainment.

  9. On-board Service: Satisfaction level of on-board service.

  10. Leg Room Service: Satisfaction level of leg room service.

  11. Baggage Handling: Satisfaction level of baggage handling.

  12. Check-in Service: Satisfaction level of check-in service.

  13. Inflight Service: Satisfaction level of inflight service.

  14. Cleanliness: Satisfaction level of cleanliness.

Target Variable

  • Satisfaction: The “Satisfaction” variable represents the airline passenger’s satisfaction level and includes two categories: “satisfied” or “neutral or dissatisfied.” This will be our primary outcome variable for analysis.

In our research, we will explore how these variables interact and contribute to passenger satisfaction levels. We will use statistical methods and modeling techniques to gain insights into the factors that lead to customer satisfaction for an airline.

Variable limitations

While the analysis and insight generation opportunities are manyfold, certain fields in this dataset can present challenges limiting a resulting model’s predictive validity. These include:

  • Data collection: this dataset was sourced from Kaggle (Klein, 2020). While some variable-related documentation is available, we are not able to discern the circumstances under which this survey was distributed. The population may have been sampled through certain methods—such as convenience sampling—that make resulting data less representative of the overall population despite the large observation count. The overall population in question also is not clear; the survey may have focused on a particular airport or region, limiting potential predictive validity in alternative settings.

  • Loyal/disloyal clarity: the document does not elaborate upon what counts as a “loyal” or “disloyal” customer for that field. This makes it difficult to properly interpret the effects of such a variable in a regression model. The threshold for disloyalty could potentially range from using any other airlines at all to using other airlines a majority of the time, drastically altering any potential real-world applications.

  • Ticket prices: ticket prices are not included in this survey, with class serving as a rough proxy; intuitively, such prices could play a major factor in passengers’ service expectations and their subsequent ratings. The lack of price ranges associated with seat class also makes it difficult to encode the three categories in a way that accurately captures the disparity.

Loading the Data

We first imported the data into R by using read.csv() function. The first few rows in the dataset are included below.

Head
X id Gender Customer.Type Age Type.of.Travel Class Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location Food.and.drink Online.boarding Seat.comfort Inflight.entertainment On.board.service Leg.room.service Baggage.handling Checkin.service Inflight.service Cleanliness Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
0 70172 Male Loyal Customer 13 Personal Travel Eco Plus 460 3 4 3 1 5 3 5 5 4 3 4 4 5 5 25 18 neutral or dissatisfied
1 5047 Male disloyal Customer 25 Business travel Business 235 3 2 3 3 1 3 1 1 1 5 3 1 4 1 1 6 neutral or dissatisfied
2 110028 Female Loyal Customer 26 Business travel Business 1142 2 2 2 2 5 5 5 5 4 3 4 4 4 5 0 0 satisfied
3 24026 Female Loyal Customer 25 Business travel Business 562 2 5 5 5 2 2 2 2 2 5 3 1 4 2 11 9 neutral or dissatisfied
4 119299 Male Loyal Customer 61 Business travel Business 214 3 3 3 3 4 5 5 3 3 4 4 3 3 3 0 0 satisfied

Checking data structure and dimensions

Data structure

## 'data.frame':    103904 obs. of  25 variables:
##  $ X                                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id                               : int  70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
##  $ Gender                           : chr  "Male" "Male" "Female" "Female" ...
##  $ Customer.Type                    : chr  "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
##  $ Age                              : int  13 25 26 25 61 26 47 52 41 20 ...
##  $ Type.of.Travel                   : chr  "Personal Travel" "Business travel" "Business travel" "Business travel" ...
##  $ Class                            : chr  "Eco Plus" "Business" "Business" "Business" ...
##  $ Flight.Distance                  : int  460 235 1142 562 214 1180 1276 2035 853 1061 ...
##  $ Inflight.wifi.service            : int  3 3 2 2 3 3 2 4 1 3 ...
##  $ Departure.Arrival.time.convenient: int  4 2 2 5 3 4 4 3 2 3 ...
##  $ Ease.of.Online.booking           : int  3 3 2 5 3 2 2 4 2 3 ...
##  $ Gate.location                    : int  1 3 2 5 3 1 3 4 2 4 ...
##  $ Food.and.drink                   : int  5 1 5 2 4 1 2 5 4 2 ...
##  $ Online.boarding                  : int  3 3 5 2 5 2 2 5 3 3 ...
##  $ Seat.comfort                     : int  5 1 5 2 5 1 2 5 3 3 ...
##  $ Inflight.entertainment           : int  5 1 5 2 3 1 2 5 1 2 ...
##  $ On.board.service                 : int  4 1 4 2 3 3 3 5 1 2 ...
##  $ Leg.room.service                 : int  3 5 3 5 4 4 3 5 2 3 ...
##  $ Baggage.handling                 : int  4 3 4 3 4 4 4 5 1 4 ...
##  $ Checkin.service                  : int  4 1 4 1 3 4 3 4 4 4 ...
##  $ Inflight.service                 : int  5 4 4 4 3 4 5 5 1 3 ...
##  $ Cleanliness                      : int  5 1 5 2 3 1 2 4 2 2 ...
##  $ Departure.Delay.in.Minutes       : int  25 1 0 11 0 0 9 4 0 0 ...
##  $ Arrival.Delay.in.Minutes         : num  18 6 0 9 0 0 23 0 0 0 ...
##  $ satisfaction                     : chr  "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
  • X and id: These columns represent some unique identifiers for each observation. X appears to be an integer index, while id is also an integer and likely represents a customer ID or some form of identifier.

  • Gender: This column contains information about the gender of the passengers, with values such as “Male” and “Female.”

  • Customer.Type: This variable describes the customer as a “Loyal Customer” or a “disloyal Customer.”

  • Age: Represents the age of the passengers and is an integer variable.

  • Type.of.Travel: Indicates the purpose of travel with two levels, “Personal Travel” and “Business Travel.”

  • Class: Specifies the class of travel with three levels, including “Business,” “Economy,” and “Economy Plus.”

  • Flight.Distance: This variable contains the distance of the flight in miles as an integer.

  • Inflight.wifi.service, Departure.Arrival.time.convenient, and several other columns: These variables seem to represent passengers’ ratings or feedback on different aspects of their flight experience. They are integer variables with ratings ranging from 0 to 5.

  • Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes: These columns represent the delay in minutes for departure and arrival, respectively. Departure delay is an integer, while arrival delay is a numeric variable; this betrays initial expectations, since we would have expected both delay columns to contain identical types. The likely culprit is a discrepancy in respondents’ uses of decimal values to represent delays.

  • satisfaction: This is the target variable or the outcome of interest, and it represents customer satisfaction levels with values like “neutral or dissatisfied” and “satisfied.”

Data dimensions

This is a data frame with 103904 observations (rows) and 25 variables (columns). Assuming that a robust sampling method was utilized, the large number of observations may allow us to conclude that the data is generally representative of the actual population.

An initial description of the data

## data 
## 
##  25  Variables      103904  Observations
## ------------------------------------------------------------
## X 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0   103904        1    51952    34635 
##      .05      .10      .25      .50      .75      .90 
##     5195    10390    25976    51952    77927    93513 
##      .95 
##    98708 
## 
## lowest :      0      1      2      3      4
## highest: 103899 103900 103901 103902 103903
## ------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0   103904        1    64924    43260 
##      .05      .10      .25      .50      .75      .90 
##     6593    13044    32534    64857    97368   116884 
##      .95 
##   123410 
## 
## lowest :      1      2      3      4      5
## highest: 129874 129875 129878 129879 129880
## ------------------------------------------------------------
## Gender 
##        n  missing distinct 
##   103904        0        2 
##                         
## Value      Female   Male
## Frequency   52727  51177
## Proportion  0.507  0.493
## ------------------------------------------------------------
## Customer.Type 
##        n  missing distinct 
##   103904        0        2 
##                                               
## Value      disloyal Customer    Loyal Customer
## Frequency              18981             84923
## Proportion             0.183             0.817
## ------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0       75        1    39.38    17.32 
##      .05      .10      .25      .50      .75      .90 
##       14       20       27       40       51       59 
##      .95 
##       64 
## 
## lowest :  7  8  9 10 11, highest: 77 78 79 80 85
## ------------------------------------------------------------
## Type.of.Travel 
##        n  missing distinct 
##   103904        0        2 
##                                           
## Value      Business travel Personal Travel
## Frequency            71655           32249
## Proportion            0.69            0.31
## ------------------------------------------------------------
## Class 
##        n  missing distinct 
##   103904        0        3 
##                                      
## Value      Business      Eco Eco Plus
## Frequency     49665    46745     7494
## Proportion    0.478    0.450    0.072
## ------------------------------------------------------------
## Flight.Distance 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0     3802        1     1189     1066 
##      .05      .10      .25      .50      .75      .90 
##      175      236      414      843     1743     2750 
##      .95 
##     3383 
## 
## lowest :   31   56   67   73   74, highest: 4243 4502 4817 4963 4983
## ------------------------------------------------------------
## Inflight.wifi.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.956     2.73    1.492 
##                                               
## Value          0     1     2     3     4     5
## Frequency   3103 17840 25830 25868 19794 11469
## Proportion 0.030 0.172 0.249 0.249 0.191 0.110
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Arrival.time.convenient 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.962     3.06    1.716 
##                                               
## Value          0     1     2     3     4     5
## Frequency   5300 15498 17191 17966 25546 22403
## Proportion 0.051 0.149 0.165 0.173 0.246 0.216
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Ease.of.Online.booking 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.961    2.757    1.578 
##                                               
## Value          0     1     2     3     4     5
## Frequency   4487 17525 24021 24449 19571 13851
## Proportion 0.043 0.169 0.231 0.235 0.188 0.133
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Gate.location 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.952    2.977    1.437 
##                                               
## Value          0     1     2     3     4     5
## Frequency      1 17562 19459 28577 24426 13879
## Proportion 0.000 0.169 0.187 0.275 0.235 0.134
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Food.and.drink 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.956    3.202    1.499 
##                                               
## Value          0     1     2     3     4     5
## Frequency    107 12837 21988 22300 24359 22313
## Proportion 0.001 0.124 0.212 0.215 0.234 0.215
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Online.boarding 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.951     3.25    1.501 
##                                               
## Value          0     1     2     3     4     5
## Frequency   2428 10692 17505 21804 30762 20713
## Proportion 0.023 0.103 0.168 0.210 0.296 0.199
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Seat.comfort 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.945    3.439    1.462 
##                                               
## Value          0     1     2     3     4     5
## Frequency      1 12075 14897 18696 31765 26470
## Proportion 0.000 0.116 0.143 0.180 0.306 0.255
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.entertainment 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6     0.95    3.358     1.49 
##                                               
## Value          0     1     2     3     4     5
## Frequency     14 12478 17637 19139 29423 25213
## Proportion 0.000 0.120 0.170 0.184 0.283 0.243
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## On.board.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.947    3.382    1.433 
##                                               
## Value          0     1     2     3     4     5
## Frequency      3 11872 14681 22833 30867 23648
## Proportion 0.000 0.114 0.141 0.220 0.297 0.228
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Leg.room.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6     0.95    3.351    1.471 
##                                               
## Value          0     1     2     3     4     5
## Frequency    472 10353 19525 20098 28789 24667
## Proportion 0.005 0.100 0.188 0.193 0.277 0.237
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Baggage.handling 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        5    0.926    3.632    1.282 
##                                         
## Value          1     2     3     4     5
## Frequency   7237 11521 20632 37383 27131
## Proportion 0.070 0.111 0.199 0.360 0.261
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Checkin.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.946    3.304    1.408 
##                                               
## Value          0     1     2     3     4     5
## Frequency      1 12890 12893 28446 29055 20619
## Proportion 0.000 0.124 0.124 0.274 0.280 0.198
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.service 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.924     3.64    1.274 
##                                               
## Value          0     1     2     3     4     5
## Frequency      3  7084 11457 20299 37945 27116
## Proportion 0.000 0.068 0.110 0.195 0.365 0.261
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Cleanliness 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0        6    0.953    3.286    1.471 
##                                               
## Value          0     1     2     3     4     5
## Frequency     12 13318 16132 24574 27179 22689
## Proportion 0.000 0.128 0.155 0.237 0.262 0.218
## 
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Delay.in.Minutes 
##        n  missing distinct     Info     Mean      Gmd 
##   103904        0      446     0.82    14.82    24.68 
##      .05      .10      .25      .50      .75      .90 
##        0        0        0        0       12       44 
##      .95 
##       78 
## 
## lowest :    0    1    2    3    4, highest:  933  978 1017 1305 1592
## ------------------------------------------------------------
## Arrival.Delay.in.Minutes 
##        n  missing distinct     Info     Mean      Gmd 
##   103594      310      455    0.823    15.18    25.15 
##      .05      .10      .25      .50      .75      .90 
##        0        0        0        0       13       44 
##      .95 
##       79 
## 
## lowest :    0    1    2    3    4, highest:  952  970 1011 1280 1584
## ------------------------------------------------------------
## satisfaction 
##        n  missing distinct 
##   103904        0        2 
##                                                           
## Value      neutral or dissatisfied               satisfied
## Frequency                    58879                   45025
## Proportion                   0.567                   0.433
## ------------------------------------------------------------
  1. Variable X and ID:
    • Variable ‘X’ is an integer index ranging from 0 to 103903 with no missing values.
    • Variable ‘id’ represents customer IDs and is also an integer, ranging from 1 to 129880 with no missing values.
  2. Gender:
    • There are two distinct values, ‘Female’ and ‘Male,’ with roughly equal proportions of female (50.7%) and male (49.3%) passengers.
  3. Customer Type:
    • Two distinct types of customers are present: ‘disloyal Customer’ and ‘Loyal Customer.’ ‘Loyal Customer’ is the dominant type, accounting for approximately 81.7% of passengers.
  4. Age:
    • The age variable ranges from 7 to 85 with a mean age of approximately 39.38. 50% of the respondents’ ages fall between 27 and 51.
  5. Type of Travel:
    • There are two types of travel: ‘Business travel’ (69.0%) and ‘Personal Travel’ (31.0%). Business travel is the more common type by far.
  6. Class:
    • Three distinct classes are available: ‘Business,’ ‘Eco,’ and ‘Eco Plus.’
    • ‘Business’ class is the most popular (47.8%), followed by ‘Eco’ (45.0%) and ‘Eco Plus’ (7.2%).
  7. Flight Distance:
    • The mean flight distance is approximately 1189 miles, with values ranging from 175 to 3383 miles.
  8. Inflight Wifi Service, Departure Arrival Time Convenient, Ease of Online Booking, Gate Location, Food and Drink, Online Boarding, Seat Comfort, Inflight Entertainment, On-Board Service, Legroom Service, Baggage Handling, Check-In Service, Inflight Service, and Cleanliness:
    • These variables represent passengers’ ratings on a scale from 0 to 5 for various aspects of their flight experience.
    • The mean ratings for each of these variables fall between 2.73 and 3.64.
    • 4 appears to be the most commonly selected option for most individual ratings.
  9. Departure Delay in Minutes:
    • The majority of flights have no departure delay (mean delay of 14.82 minutes).
    • Delays range from 0 to 78 minutes.
  10. Arrival Delay in Minutes:
  • Arrival delays are similar to departure delays, with the majority having no delay (mean delay of 15.18 minutes).
  • Delays range from 0 to 79 minutes.
  1. Satisfaction:
  • There are two categories of satisfaction: ‘neutral or dissatisfied’ (56.7%) and ‘satisfied’ (43.3%).
  • Overall, more passengers appear to be ‘neutral or dissatisfied’ with their flight experience.

Data Pre-processing

Duplicate values

It has total 0 duplicate values

Missing Values

The following table shows the NA values in our dataset:
Missing Values — Initial
X id Gender Customer.Type Age Type.of.Travel Class Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location Food.and.drink Online.boarding Seat.comfort Inflight.entertainment On.board.service Leg.room.service Baggage.handling Checkin.service Inflight.service Cleanliness Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 310 0

We elected to replace these 310 NA values in arrival delays with the median delay; this method was used over other potential replacement options, such as the average, due to the skewed distribution of values detailed later on.

The table below demonstrates that all missing values have been replaced; the “X” and “id” fields for index number and survey ID are also removed from the data frame due to their limited relevance for modeling.

Responses for the ratings variables are coded as values from 1-5. However, some responses include 0; as noted earlier, this indicates that the question was not applicable. Respondents that select this option for any of the ratings variables are filtered out to ensure that all of the individual ratings are relevant for all observations. While alternatives exist, such as replacement, the large number of initial observations limited our concerns over a potential loss in predictive validity.

Missing Values — Final
Gender Customer.Type Age Type.of.Travel Class Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location Food.and.drink Online.boarding Seat.comfort Inflight.entertainment On.board.service Leg.room.service Baggage.handling Checkin.service Inflight.service Cleanliness Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Summary Statistics

The following output features summary statistics for the continuous variables:

summary_stats_numeric
##       Age       Flight.Distance Departure.Delay.in.Minutes
##  Min.   : 7.0   Min.   :  31    Min.   :   0              
##  1st Qu.:28.0   1st Qu.: 438    1st Qu.:   0              
##  Median :40.0   Median : 867    Median :   0              
##  Mean   :39.8   Mean   :1222    Mean   :  15              
##  3rd Qu.:51.0   3rd Qu.:1773    3rd Qu.:  13              
##  Max.   :85.0   Max.   :4983    Max.   :1592              
##  Arrival.Delay.in.Minutes
##  Min.   :   0            
##  1st Qu.:   0            
##  Median :   0            
##  Mean   :  15            
##  3rd Qu.:  13            
##  Max.   :1584

The following output features summary statistics for the categorical/ordinal variables:

summary_stats_categorical
##   Gender_n Gender_n_distinct Gender_top_freq
## 1    95704                 2          Female
##   Customer.Type_n Customer.Type_n_distinct
## 1           95704                        2
##   Customer.Type_top_freq Type.of.Travel_n
## 1         Loyal Customer            95704
##   Type.of.Travel_n_distinct Type.of.Travel_top_freq Class_n
## 1                         2         Business travel   95704
##   Class_n_distinct Class_top_freq Inflight.wifi.service_n
## 1                3       Business                   95704
##   Inflight.wifi.service_n_distinct
## 1                                5
##   Inflight.wifi.service_top_freq
## 1                              3
##   Departure.Arrival.time.convenient_n
## 1                               95704
##   Departure.Arrival.time.convenient_n_distinct
## 1                                            5
##   Departure.Arrival.time.convenient_top_freq
## 1                                          4
##   Ease.of.Online.booking_n
## 1                    95704
##   Ease.of.Online.booking_n_distinct
## 1                                 5
##   Ease.of.Online.booking_top_freq Gate.location_n
## 1                               3           95704
##   Gate.location_n_distinct Gate.location_top_freq
## 1                        5                      3
##   Food.and.drink_n Food.and.drink_n_distinct
## 1            95704                         5
##   Food.and.drink_top_freq Online.boarding_n
## 1                       4             95704
##   Online.boarding_n_distinct Online.boarding_top_freq
## 1                          5                        4
##   Seat.comfort_n Seat.comfort_n_distinct
## 1          95704                       5
##   Seat.comfort_top_freq Inflight.entertainment_n
## 1                     4                    95704
##   Inflight.entertainment_n_distinct
## 1                                 5
##   Inflight.entertainment_top_freq On.board.service_n
## 1                               4              95704
##   On.board.service_n_distinct On.board.service_top_freq
## 1                           5                         4
##   Leg.room.service_n Leg.room.service_n_distinct
## 1              95704                           5
##   Leg.room.service_top_freq Baggage.handling_n
## 1                         4              95704
##   Baggage.handling_n_distinct Baggage.handling_top_freq
## 1                           5                         4
##   Checkin.service_n Checkin.service_n_distinct
## 1             95704                          5
##   Checkin.service_top_freq Inflight.service_n
## 1                        4              95704
##   Inflight.service_n_distinct Inflight.service_top_freq
## 1                           5                         4
##   Cleanliness_n Cleanliness_n_distinct Cleanliness_top_freq
## 1         95704                      5                    4
##   satisfaction_n satisfaction_n_distinct
## 1          95704                       2
##     satisfaction_top_freq
## 1 neutral or dissatisfied

Examining variable distributions

Frequency distributions for categorical variables

The plots above provide visual representations for the summary statistics detailed earlier. While none initially appear to be highly correlated, we intend to confirm this using variance inflation factor (VIF) analysis at a later time once our model is fleshed out (“vif: Variance Inflation Factors”, n.d.).

Given a robust sampling method, we can safely assume that these distributions (including the highly skewed ones) are representative of the overall population.

Looking at the distribution of class, Eco Plus has a significantly lower observation frequency than the other two. In addition, as noted earlier, the magnitudes of increments between Eco, Eco Plus, and Business are not clear; some transformation may be required later to ensure modeling suitability.

Frequency distributions for continuous variables

From the graphs above, flight distance as well as both delay variables have a strongly right-skewed distribution. This makes sense intuitively; we would expect most flights to have minimal to no delays, and shorter flights are likely more frequent.

Age is the only variable that somewhat approximates a normal distribution (although that cannot be safely assumed); the current graph appears to be bimodal to a degree, with a small peak around 20-25 and another peak roughly around 35-50.

Depending on the type of regression that is ultimately selected, some of these variables may require aggressive transformations to better approximate normal distributions.

Frequency distributions for ordinal variables (Ratings)

Departure Arrival time convenient, Food and Drinks, Online boarding, Seat comfort, Inflight Entertainment, On board service, Leg room service, Baggage handling, Checkin service, Inflight service and Cleanliness all have a mode value of 4. Inflight wifi service, Gate location and Ease of online booking all have a mode value of 3. Many of the distributions for individual ratings variables look quite similar, raising multicollinearity concerns that will be addressed later.

Distribution of continuous variable features by satisfaction - KDE (Kernel Density Estimation)

 

Observations


Age: Middle-aged passengers tend to exhibit higher levels of satisfaction compared to both younger and older age groups, peaking around 40-50 years of age. Meanwhile, the distribution of neutral/dissatisfied passengers peaks noticeably earlier. If age is proven to be a significant factor, this could be utilized to engage in age-targeted improvements.

Flight Distance: Passengers traveling shorter distances appear to be more inclined towards neutrality or dissatisfaction compared to those embarking on longer journeys. This insight suggests that there might be unique challenges or aspects of shorter flights that influence passenger contentment and warrant further investigation.

Arrival/Departure Delays: It is difficult to discern any meaningful differences between passengers that were satisfied or neutral/dissatisfied based on arrival or departure delay durations using this method. To expand upon these visuals—potentially revealing more significant observations—we utilized a scatter plot.


Visualizing the relationship between Arrival and Departure delays colored by satisfaction.

This graph also indicates that arrival and departure delays follow a roughly similar linear trajectory, potentially indicating high correlation between these fields.

Multicollinearity Testing

One of the essential steps in data analysis is assessing multicollinearity among independent variables. Multicollinearity occurs when predictor variables are highly correlated with each other, which can impact the reliability of regression models.

Correlation Matrices

To begin examining fields with respect to multicollinearity, we used two correlation matrices:

  1. Continuous variables

  2. Ratings variables

Continuous Variable Correlations

##       Age         Flight.Distance 
##  Min.   :-0.016   Min.   :-0.004  
##  1st Qu.:-0.014   1st Qu.:-0.001  
##  Median : 0.035   Median : 0.042  
##  Mean   : 0.264   Mean   : 0.270  
##  3rd Qu.: 0.312   3rd Qu.: 0.312  
##  Max.   : 1.000   Max.   : 1.000  
##  Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
##  Min.   :-0.013             Min.   :-0.016          
##  1st Qu.:-0.003             1st Qu.:-0.007          
##  Median : 0.480             Median : 0.478          
##  Mean   : 0.487             Mean   : 0.485          
##  3rd Qu.: 0.970             3rd Qu.: 0.970          
##  Max.   : 1.000             Max.   : 1.000

As observed earlier, arrival and departure delays appear to be highly correlated; certain steps, such as removing one of the two or calculating an average delay variable, would likely be necessary for use in a predictive model.

Ratings Variable Correlations

Outside of continuous variables, many of the ratings appear to share similar frequency distributions based on the graphs displayed earlier, sparking significant multicollinearity concerns. Our next step to evaluate these potential relationships was to create another correlation matrix.

We can see from the matrix that certain ratings variables have strong positive correlations with each other. If these are included in the model without adjustments, our model may suffer a loss in reliability.

In order to avoid this issue, we elected to combine ratings variables into two groups—based on the degree of correlation—and utilize average ratings from these two groups as model inputs.

Ratings Group 1: Pre-Flight & Wi-Fi Ratings Group 2: In-Flight & Baggage
In-Flight Wifi Service Food and Drink
Departure / Arrival Time Seat Comfort
Ease of Online Booking In-Flight Entertainment
Gate Location Onboard Service
Online Boarding Leg Room Service
Baggage Handling
Check-In Service
In-Flight Service
Cleanliness
##  Pre_Flight_and_WiFi_Ratings In_Flight_and_Baggage_Ratings
##  Min.   :1.00                Min.   :1.11                 
##  1st Qu.:2.40                1st Qu.:2.78                 
##  Median :3.00                Median :3.44                 
##  Mean   :3.04                Mean   :3.41                 
##  3rd Qu.:3.80                3rd Qu.:4.00                 
##  Max.   :5.00                Max.   :5.00

The two consolidated ratings variables share a weak positive correlation (correlation coefficient = 0.181), indicating that they can be jointly included in our model without violating the collinearity assumption.

Ruling out the standard linear model

Before engaging in further analysis, we first identified that satisfaction—as a categorical/binary variable—cannot be reliably predicted through a standard linear model. View the model below and its predicted vs. actual values for a rough demonstration of this roadblock.

## 
## Call:
## lm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + 
##     Class + Flight.Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7973 -0.0918 -0.0620  0.2440  1.3544 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.69e-01   5.04e-03  -73.15  < 2e-16 ***
## Gender           2.92e-03   2.51e-03    1.17     0.24    
## Customer.Type    4.10e-01   3.96e-03  103.42  < 2e-16 ***
## Age              5.34e-04   8.72e-05    6.12  9.4e-10 ***
## Type.of.Travel   4.34e-01   3.61e-03  120.27  < 2e-16 ***
## Class            2.42e-01   3.38e-03   71.81  < 2e-16 ***
## Flight.Distance  8.38e-06   1.45e-06    5.77  8.1e-09 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.387 on 95697 degrees of freedom
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.386 
## F-statistic: 1e+04 on 6 and 95697 DF,  p-value: <2e-16

We can see that, despite purportedly containing various coefficients of high statistical significance, the standard linear model is fundamentally incapable of modeling binary output. For various x-values, the linear model predicts unattainable values between satisfied or neutral/dissatisfied (encoded as 1 and 0 respectively).

Rather than a linear model, we will evaluate and prepare the data for use in a logistic regression, which predicts the log odds of satisfaction. This is the dominant approach for modeling binary variables; a linear probability model—which sees use, particularly among social scientists—could potentially serve as a viable alternative, but an evaluation of this method will be reserved for a later date (Allison, 2015). Logistic regression models utilize different assumptions relative to linear models, significantly altering the necessary EDA steps (“Assumptions of Logistic Regression”, n.d.).

Assumptions are altered as follows:

  • Linearity: Rather than a linear relationship between parameters and the dependent variable, logistic regression assumes a linear relationship between parameters and the log odds

  • Independence of Errors: Remains as an assumption for both linear and logistic models

  • Homoscedasticity: Not required under logistic regression

  • Normally distributed residuals: Not required under logistic regression

  • Multicollinearity: Remains as an assumption for both linear and logistic models

Testing Linearity with log odds

Unlike a standard linear regression, which assumes that independent parameters have a linear relationship with the dependent variable, logistic regression assumes that parameters have a linear relationship with the log odds (“Assumptions of Logistic Regression”, n.d.).

Odds represent the number of favorable outcomes divided by the number of unfavorable outcomes. Put differently, if “p” represents the probability of favorable outcomes, Odds = p/(1-p). Log odds take the natural log of the odds, which can be expressed as ln(p/1-p)) (Agarwal, 2019).

We can use a visual test to examine whether or not this assumption holds true for continuous variables. While it is not sensible to compute log odds for individual data points, we can group continuous variables into discrete buckets—calculating the average log odds for each—to examine whether or not they might satisfy this assumption.

Out of the graphs above, it appears that only flight distance has a roughly linear relationship with log odds of satisfaction. Age appears to have a parabolic relationship, peaking in the middle; some sort of aggressive transformation method may be required to reach a linear relationship. Meanwhile, log odds for both delay statistics quickly disperse in both directions as they increase (likely in part due to the limited frequency of higher durations), making it difficult to conclude with certainty that a linear relationship exists.

In-flight and baggage ratings have a strikingly linear relationship with log odds; meanwhile, pre-flight and wi-fi ratings appear to have a significantly looser connection with a potential dip in log odds for average ratings. We can conclude with confidence that the in-flight aggregate fulfills the linearity assumption, while the pre-flight ratings are far more obscure in that regard.

Conclusion

Following our EDA attempts, our research questions remain relatively intact.

  1. A logistic regression model can help us understand the effects of certain variables on the likelihood of passenger satisfaction.

  2. Efforts to test for log odds linearity and multicollinearity have helped us prepare for a model that minimizes predictive bias. Following an initial logistic regression model, VIF analysis and other strategies can be used to expand upon multicollinearity testing and examine statistical significance with hypothesis testing.

  3. It seems like continuous, ordinal, and categorical variables can all be feasibly combined as inputs in our predictive model. However, questions regarding encoding still remain for the class variable. The simplest approach would likely be to combine Eco and Eco Plus into a single category when encoding, turning class into a binary variable for modeling purposes and limiting necessary assumptions.

  4. In addition to our previous questions, the issue of interpretability emerges. Our intent is to generate a predictive model that can be applied by firms in the airline industry; however, that requires the model’s various factors to correspond to real-world adjustments. For example, we may have to exponentiate model coefficients for more intuitive explanations (Allison, 2015). Variable aggregates and other potential transformations pose separate issues on this front as well.

Citations

Klein, TJ (2020). Airline Passenger Satisfaction. Kaggle. https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction?select=train.csv

Lutz, A., & Lubin, G. (2012). Airlines Have An Insanely Small Profit Margin. Business Insider. https://www.businessinsider.com/airlines-have-a-small-profit-margin-2012-6

Hardee, H. (2023). Frontier reports lacklustre Q3 results as it struggles in ‘over-saturated’ core markets. FlightGlobal. https://www.flightglobal.com/strategy/frontier-reports-lacklustre-q3-results-as-it-struggles-in-over-saturated-core-markets/155561.article

vif: Variance Inflation Factors. (n.d.). R Package Documentation. https://rdrr.io/cran/car/man/vif.html

Allison, P. (2015, April 1). What’s So Special About Logit?. Statistical Horizons. https://statisticalhorizons.com/whats-so-special-about-logit/

Assumptions of Logistic Regression. (n.d.). Statistics Solutions. https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-logistic-regression/

Agarwal, P. (2019, July 8). WHAT and WHY of Log Odds. Towards Data Science. https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704